Quantitative notions of relevance: Data analysis

Author

Alex Warstadt, Omar Aghar & Michael Franke

First load required packages and set some global parameters.

Code
library(tidyverse)
library(brms)
library(tidyboot)
library(tidyjson)
library(patchwork)
library(GGally)
library(cowplot)
library(BayesFactor)
library(aida)   # custom helpers: https://github.com/michael-franke/aida-package
library(faintr) # custom helpers: https://michael-franke.github.io/faintr/index.html
library(cspplot) # custom styles: https://github.com/CogSciPrag/cspplot
library(ordbetareg) # for ordered beta-regression
library(cmdstanr)

##################################################

# these options help Stan run faster
options(mc.cores = parallel::detectCores(),
        brms.backend = "cmdstanr")

# use the CSP-theme for plotting
theme_set(theme_csp())

# global color scheme from CSP
project_colors = cspplot::list_colors() |> pull(hex)

# setting theme colors globally
scale_colour_discrete <- function(...) {
  scale_colour_manual(..., values = project_colors)
}
scale_fill_discrete <- function(...) {
  scale_fill_manual(..., values = project_colors)
}

##################################################

rerun_models <- FALSE

Lambert_test <- function(loo_comp) {
  1 - pnorm(-loo_comp[2,1], loo_comp[2,2])
}

##################################################
rl_file <- "R_data_4_TeX/myvars.csv"
myvars = list()
# add with: myvars[name] = value

Read & massage the data

The data is pre-processed (using Python). We load it and rearrange for convenience.

Code
d <- read_csv("../results/round_2.0/results_preprocessed.csv") |> 
  # drop column with numbers
  select(-`...1`) |> 
  # set "non-answers" to AnswerPolarity "positive"
  mutate(AnswerPolarity = ifelse(
    AnswerCertainty == "non_answer", 
    "positive", 
    AnswerPolarity)) |> 
  # casting into factor
  mutate(
    group = factor(group, levels = c("relevant", "helpful")),
    ContextType = factor(ContextType, 
                         levels = c("negative", "neutral", "positive")),
    AnswerPolarity = factor(AnswerPolarity, 
                            levels = c("positive", "negative")),
    AnswerCertainty = factor(AnswerCertainty, 
                             levels = c("non_answer", "low_certainty", 
                                        "high_certainty", "exhaustive"))
  ) 

Data exclusion

We exclude all data from participants who:

  • score less than perfect on all attention checks,
  • scored less than 0.5 on reasoning tasks, or
  • has task-sensitivity of not more than 0.75

Task sensitivity is the proportion of critical trials, excluding non-answer trials, in which the change between prior and posterior rating was bigger than 0.05 or there was a non-zero change in commitment rating.

We first add the task-sensitivity score:

Code
d <- d |> as_tibble() |> 
  mutate(answer_class  = ifelse(AnswerCertainty != "non_answer", "answer", "non_answer"),
         belief_change = abs(prior_sliderResponse - posterior_sliderResponse) >= 0.05 |                                     prior_confidence != posterior_confidence,
         deviant = case_when(answer_class == "answer" ~ !belief_change,
                             answer_class != "answer" ~  FALSE)
         ) |> 
  group_by(submission_id) |> 
  mutate(task_sensitivity = 1- sum(deviant) / sum(answer_class == "answer")) |> 
  select(- answer_class, - belief_change, - deviant)

Apply exclusion criteria:

Exploring the effect of the experimental factors

The experiment had the following factors:

  • group: whether a participant rated the ‘helpfulness’ or the ‘relevance’ of the answer (between-subjects variable)
  • ContextType: whether the context made a ‘no’ or a ‘yes’ answer more likely /a priori/ or whether it was neutral (within-subjects)
  • AnswerCertainty: how much information the answer provides towards a fully resolving answer (within-subjects)
  • AnswerPolarity: whether the answer suggests or implies a ‘no’ or a ‘yes’ answer (within-subjects)
    • ‘non-answers’ are treated as ‘positive’ for technical purposes, but this does not influence relevant analyses

In the following, we first check whether these experimental manipulations worked as intended.

Sanity-checking whether the manipulations worked as intended

Effects of ContextType on prior and prior commitment

To check whether the ContextType manipulation worked, we compare how participants rated the prior probability of a ‘yes’ answer under each level of the ContextType variable. Concretely, we expect this order of prior ratings for the levels of ContextType: negative < neutral < positive. Although we have no specific hypotheses or sanity-checking questions regarding the commitment ratings, let’s also scrutinize the commitment ratings that participants gave with their prior ratings.

Prior ratings as a function of ContextType

Here is a first plot addressing the question after an effect of ContextType on participants prior ratings.

Code
d |> ggplot(aes(x = prior_sliderResponse, color = ContextType, fill = ContextType)) +
  geom_density(alpha = 0.3) + 
  xlab("prior rating") +
  ylab("")

Code
ggsave("plots/sanity_check_context_type_prior.pdf", scale = 1.2, height = 4, width = 5)

We dive deeper by fitting a regression model, predicting prior ratings in terms of the ContextType. Since participants have not seen the answer when they rate the prior probability of a ‘yes’ answer, ContextType is the only fixed effect we should include here. The model also includes the maximal RE structure. We use the ordbetareg package for (slider-data appropriate) zero-one inflated ordinal beta regression.

Code
if (rerun_models) {
  fit_contextType_SC <- ordbetareg::ordbetareg(
    prior_sliderResponse ~ ContextType + 
      (1 + ContextType | StimID) + 
      (1 + ContextType | submission_id),
    data = d,
    iter = 3000 # ensure sufficent ESS 
  )
  saveRDS(fit_contextType_SC, "cachedModels-round2/fit_contextType_SC.Rds")
} else{
  fit_contextType_SC <- readRDS("cachedModels-round2/fit_contextType_SC.Rds")
}

Our assumption is that prior ratings are higher in contexts classified as ‘neutral’ than in ‘negative’ contexts, and yet higher in ‘positive’ contexts. We use the faintr package to extract information on these directed comparisons.

Code
ContextType_SC_negVSneu <- 
  compare_groups(
    fit_contextType_SC, 
    lower = ContextType == "negative",
    higher = ContextType == "neutral" 
  )
ContextType_SC_neuVSpos <- 
  compare_groups(
    fit_contextType_SC, 
    lower = ContextType == "neutral",
    higher = ContextType == "positive"
  )

Prior commitment as a function of ContextType

Here is a visualization of the effect of ContextType on participants’ commitment in their prior ratings.

Code
d |> mutate(prior_confidence = factor(prior_confidence, levels = 1:7)) |> 
  ggplot(aes(x = prior_confidence, color = ContextType, fill = ContextType)) +
  geom_histogram(position = "dodge", stat='count') + 
  xlab("prior commitment") +
  ylab("")

Code
ggsave("plots/sanity_check_context_type_confidence.pdf", scale = 1.2, height = 4, width = 5)

To scrutinize the effect of ContextType on participants expressed commitment in their prior ratings, we use a ordered-logit regression (since prior commitment ratings are from a rating scale).

Code
if (rerun_models) {
  fit_contextType_SC_conf <- brm(
    prior_confidence ~ ContextType + 
      (1 + ContextType | StimID) + 
      (1 + ContextType | submission_id),
    data = d,
    family = cumulative(),
    iter = 3000
  )
  saveRDS(fit_contextType_SC_conf, "cachedModels-round2/fit_contextType_SC_conf.Rds")
} else{
  fit_contextType_SC_conf <- readRDS("cachedModels-round2/fit_contextType_SC_conf.Rds")
}
Code
ContextType_SC_neuVSneg_conf <- 
  compare_groups(
    fit_contextType_SC_conf, 
    lower = ContextType == "neutral",
    higher = ContextType == "negative" 
  )
ContextType_SC_negVSpos_conf <- 
  compare_groups(
    fit_contextType_SC_conf, 
    lower = ContextType == "negative",
    higher = ContextType == "positive"
  )
ContextType_SC_neuVSpos_conf <- 
  compare_groups(
    fit_contextType_SC_conf, 
    lower = ContextType == "neutral",
    higher = ContextType == "positive"
  )

Results

The results of these comparisons are summarized here:

comparison measure posterior HDI-low HDI-high
negative < neutral prior 0.9973333 0.292325 1.134670
neutral < positive prior 0.9998333 0.507056 1.558420
neutral < negative prior-commitment 0.9891667 0.131985 1.284680
negative < positive prior-commitment 0.7971667 -0.505641 1.205520
neutral < positive prior-commitment 0.9958333 0.347629 1.798132

The ContextType manipulation seems to have worked as expected for the prior ratings: lower in ‘negative’ than in ‘neutral’ than in ‘positive’. There is no support for differences in the commitment ratings, except that the positive context case seems to induce more commitment than the neutral context.

Effects of AnswerPolarity and AnswerCertainty on beliefChange

We can define beliefChange as the difference between posterior and prior in the direction expected from the answer’s polarity (posterior belief in ‘yes’ answer increases for a ‘positive’ answer when compared with the prior rating, but it decreases for ‘negative’ answers). (Careful: we ignore non-answers (which are categorized as “positive” for technical convenience only).) If our manipulation worked, we expect the following for both ‘positive’ and ‘negative’ polarity:

  1. beliefChange is > 0
  2. beliefChange is lower for ‘low certainty’ than for ‘high certainty’ than for ‘exhaustive’

Here is a plot visually addressing these issues:

Code
d |> filter(AnswerCertainty != "non_answer") |> 
  mutate(beliefChange = posterior_sliderResponse - prior_sliderResponse,
         beliefChange = ifelse(AnswerPolarity == "positive", beliefChange, - beliefChange)) |> 
  mutate(AnswerCertainty = case_when(AnswerCertainty == "low_certainty" ~ "low",
                                     AnswerCertainty == "high_certainty" ~ "high",
                                     TRUE ~ "exhaustive",
                                     ) |> factor(level = c("low", "high","exhaustive"))) |> 
  ggplot(aes(x = beliefChange, color = AnswerCertainty, fill = AnswerCertainty)) +
  geom_density(alpha = 0.2, linewidth=2) +
  facet_grid(. ~ AnswerPolarity) +
  xlab("belief change (in expected direction)") +
  ylab("") + theme_aida()

Code
ggsave("plots/belief-change.pdf", scale = 1.2, height = 4, width = 10)

beliefChange is positive

To address the first issue, whether beliefChange is positive for both types of polartiy, we first regress beliefChange against the full list of potentially relevant factors, including plausible RE structures. Notice that at the time of answer the questions related to the posterior, participants have not yet seen the question after relevance or helpfulness, so that factor group should be ommitted.

Code
# TODO: strictly speaking, this is data from a bounded scale; different regression model would be appropiate
if (rerun_models) {
  fit_answer_SC <- brm(
    formula = beliefChange ~ ContextType * AnswerCertainty * AnswerPolarity +
      (1 + ContextType + AnswerCertainty + AnswerPolarity | StimID) +
      (1 + ContextType + AnswerCertainty + AnswerPolarity | submission_id),
    data = d |> filter(AnswerCertainty != "non_answer") |> 
      mutate(beliefChange = posterior_sliderResponse - prior_sliderResponse,
             beliefChange = ifelse(AnswerPolarity == "positive", beliefChange, - beliefChange))
  )
  saveRDS(fit_answer_SC, "cachedModels-round2/fit_answer_SC.Rds")
} else{
  fit_answer_SC <- readRDS("cachedModels-round2/fit_answer_SC.Rds")
}

We check if inferred cell means are credibly bigger than zero, for all six relevant design cells (facets in the plot above).

Code
# 1. Check if belief change in each cell is bigger than zero
cellDraws_answers <- tibble(
  low_pos  = filter_cell_draws(
    fit_answer_SC, AnswerCertainty == "low_certainty"  & AnswerPolarity == "positive", 
    "low_pos")$low_pos,
  high_pos = filter_cell_draws(
    fit_answer_SC, AnswerCertainty == "high_certainty" & AnswerPolarity == "positive", 
    "high_pos")$high_pos,
  exh_pos  = filter_cell_draws(
    fit_answer_SC, AnswerCertainty == "exhaustive"     & AnswerPolarity == "positive", 
    "exh_pos")$exh_pos,
  low_neg  = filter_cell_draws(
    fit_answer_SC, AnswerCertainty == "low_certainty"  & AnswerPolarity == "negative", 
    "low_neg")$low_neg,
  high_neg = filter_cell_draws(
    fit_answer_SC, AnswerCertainty == "high_certainty" & AnswerPolarity == "negative", 
    "high_neg")$high_neg,
  exh_neg  = filter_cell_draws(fit_answer_SC, AnswerCertainty == "exhaustive"     & AnswerPolarity == "negative", "exh_neg")$exh_neg
) 

# all posterior 95% HDIs are wayabove 0 
apply( cellDraws_answers |> as.matrix(), MARGIN = 2, aida::summarize_sample_vector)
$low_pos
# A tibble: 1 × 4
  Parameter `|95%`  mean `95%|`
  <chr>      <dbl> <dbl>  <dbl>
1 ""        0.0861 0.131  0.176

$high_pos
# A tibble: 1 × 4
  Parameter `|95%`  mean `95%|`
  <chr>      <dbl> <dbl>  <dbl>
1 ""         0.202 0.257  0.306

$exh_pos
# A tibble: 1 × 4
  Parameter `|95%`  mean `95%|`
  <chr>      <dbl> <dbl>  <dbl>
1 ""         0.396 0.447  0.499

$low_neg
# A tibble: 1 × 4
  Parameter `|95%`  mean `95%|`
  <chr>      <dbl> <dbl>  <dbl>
1 ""         0.105 0.177  0.250

$high_neg
# A tibble: 1 × 4
  Parameter `|95%`  mean `95%|`
  <chr>      <dbl> <dbl>  <dbl>
1 ""         0.206 0.282  0.356

$exh_neg
# A tibble: 1 × 4
  Parameter `|95%`  mean `95%|`
  <chr>      <dbl> <dbl>  <dbl>
1 ""         0.301 0.363  0.428
Code
# posterior probability of mean bigger 0 for each cell is almost 1 everywhere
apply(as.matrix(cellDraws_answers), MARGIN=2, function(x) {mean(x>0)})
 low_pos high_pos  exh_pos  low_neg high_neg  exh_neg 
       1        1        1        1        1        1 

These results suggest that there is little reason to doubt that the belief changes induces by the answers -as per the experimentally intended manipulation- went in the right direction in all cases.

beliefChange increases with more informative answers

Finally, we investigate whether beliefChange increases with more informative answers, using the same regression model as before.

Code
AnswerPolarity_main <- compare_groups(
  fit_answer_SC,
  lower = AnswerPolarity == "positive",
  higher  = AnswerPolarity == "negative"
)

AnswerCertainty_lowVShigh <- compare_groups(
  fit_answer_SC,
  lower   = AnswerCertainty == "low_certainty",
  higher  = AnswerCertainty == "high_certainty"
)

AnswerCertainty_highVSexh <- compare_groups(
  fit_answer_SC,
  lower   = AnswerCertainty == "high_certainty",
  higher  = AnswerCertainty == "exhaustive"
)
comparison measure posterior HDI (low) HDI (high)
pos vs neg polarity belief change 0.44000 -0.0708233 0.0726474
low vs high certainty belief change 1.00000 0.0671545 0.1705422
high certain vs exh belief change 0.99925 0.0713969 0.2049482

We see no indication of a main effect of polarity, but find support for the idea that our manipulation of AnswerCertainty induced gradually larger belief changes. I sum, it seems that the stimuli were adequately created to implement the intended manipulation in the variables AnswerCertainty and AnswerPolarity.

Predicting relevance in terms of the experimental factors

We want to explore how relevance ratings depend on the experimental manipulations. First, we check whether the group variable (the trigger word: ‘helpful’ vs ‘relevant’) is important. If not, we can simplify subsequent analyses.

Next, we investigate the effects variables like AnswerCertainty AnswerPolarity etc. on relevance ratings.

Can we gloss over the different trigger words?

To simplify analyses, it would be helpful to know whether we can gloss over the group manipulation. So, does it matter whether participants were asked to rate relevance or helpfulness?

To start with, let’s just look at whether there is a main effect, which there is not (possibly also partially explained away by by-subject random slopes):

To further investigate this contrast, we may fit two beta regression models, one with and one without the group factor. We check whether there is a credible main effect of group in the full model and a significant difference in LOO score when comparing these models. The former is for fun, the latter determines whether we should lump trigger words together.

Code
if (rerun_models) {
  
  # TODO too small ESS
  fit_with_group_ordBeta <- ordbetareg::ordbetareg(
    # must omitt interactions in the REs to ensure proper fit
    formula = relevance_sliderResponse ~ group * ContextType * AnswerCertainty * AnswerPolarity + 
      (1 + group + ContextType + AnswerCertainty + AnswerPolarity || StimID) + 
      (1 + ContextType + AnswerCertainty + AnswerPolarity || submission_id),
    data = d,
    # set this prior, otherwise there are errors
    coef_prior_SD = 5,
    save_pars = save_pars(all=T)
  )
  
  fit_without_group_ordBeta <- ordbetareg::ordbetareg(
    relevance_sliderResponse ~ ContextType * AnswerCertainty * AnswerPolarity +
      (1 + ContextType + AnswerCertainty + AnswerPolarity || StimID) + 
      (1 + ContextType + AnswerCertainty + AnswerPolarity || submission_id),
    data = d,
    coef_prior_SD = 5,
    save_pars = save_pars(all=TRUE)
  )
  saveRDS(fit_with_group_ordBeta, "cachedModels-round2/fit_with_group_ordBeta.Rds")
  saveRDS(fit_without_group_ordBeta, "cachedModels-round2/fit_without_group_ordBeta.Rds")
} else {
  fit_with_group_ordBeta    <- read_rds("cachedModels-round2/fit_with_group_ordBeta.Rds")
  fit_without_group_ordBeta <- read_rds("cachedModels-round2/fit_without_group_ordBeta.Rds")
}

We inspect whether there is a main effect of group in the full model:

Code
# main effect of "group" ?
group_main <- compare_groups(
  fit_with_group_ordBeta,
  higher = group == "relevant",
  lower  = group == "helpful"
)
print(group_main)
Outcome of comparing groups: 
 * higher:  group == "relevant" 
 * lower:   group == "helpful" 
Mean 'higher - lower':  0.2069 
95% HDI:  [ 0.07441 ; 0.3409 ]
P('higher - lower' > 0):  0.9978 
Posterior odds:  443.4 

The posterior for the main factor of group is credibly different from zero.

But that is not the best criterion to decide whether the group distinction is relevant for predictions. We there also compare models with LOO cross-validation:

Code
if (rerun_models) {
  loo_comp <- loo::loo_compare(
    list("w/__group" = loo(fit_with_group_ordBeta),
                           # moment_match = TRUE, 
                           # reloo = TRUE),
         "w/o_group" = loo(fit_without_group_ordBeta)))
                           # moment_match = TRUE,
                           # reloo = TRUE)))
  saveRDS(loo_comp, "cachedModels-round2/loo_comp_groups.Rds")
} else {
  loo_comp = read_rds("cachedModels-round2/loo_comp_groups.Rds")
}
loo_comp
          elpd_diff se_diff
w/o_group  0.0       0.0   
w/__group -2.8       6.9   
Code
Lambert_test(loo_comp)
[1] 0.9999762

It appears that, when comparing these models with REs, the lesioned model is numerically even slightly better than the model including group, but not by a substantial margin. From a predictive point of view, including group does not add substantial value. We will therefore lump trigger words together in the following.

Effect of AnswerPolarity, AnswerCertainty and ContextType on relevance ratings

To investigate further which experimental factors influence the ratings of relevance of an answer, start by a visualization:

Code
d |> 
  mutate(AnswerCertainty = case_when(AnswerCertainty == "low_certainty" ~ "low",
                                     AnswerCertainty == "high_certainty" ~ "high",
                                     AnswerCertainty == "non_answer" ~ "non-answer",
                                     TRUE ~ "exhaustive",
                                     ) |> factor(level = c("non-answer", "low", "high","exhaustive"))) |> 
  ggplot(aes(x = relevance_sliderResponse, color = AnswerPolarity, fill = AnswerPolarity)) +
  facet_grid(AnswerCertainty ~ ContextType , scales = "free") +
  geom_density(alpha = 0.2, linewidth = 1.5) +
  xlab("relevance rating") + ylab("")

Code
ggsave("plots/relevance-ratings.pdf", scale=1.2, width = 10, height=6)

Let’s now look at a bunch of contrasts (based on the previously fitted full model).

Code
# fit_to_use = fit
fit_to_use = fit_without_group_ordBeta

## expected ordering relation?
## non-answers vs low-certainty => poster = 1
nonAns_VS_low  <- compare_groups(
  fit_to_use,
  lower  = AnswerCertainty == "non_answer",
  higher = AnswerCertainty == "low_certainty"
)
## low-certainty vs high-certainty => poster = 0.9922
low_VS_high <- compare_groups(
  fit_to_use,
  lower  = AnswerCertainty == "low_certainty",
  higher = AnswerCertainty == "high_certainty"
)
## high-certainty vs exhaustive => poster = 1
high_VS_exh <- compare_groups(
  fit_to_use,
  lower  = AnswerCertainty == "high_certainty",
  higher = AnswerCertainty == "exhaustive"
)


## effects of AnswerPolarity?
AnswerPolarity_main <- compare_groups(
  fit_to_use,
  lower  = AnswerPolarity == "positive" & AnswerCertainty != "non_answer",
  higher = AnswerPolarity == "negative" & AnswerCertainty != "non_answer"
)

AnswerPolarity_lowCertain <- compare_groups(
  fit_to_use,
  lower  = AnswerPolarity == "positive" & AnswerCertainty == "low_certainty",
  higher = AnswerPolarity == "negative" & AnswerCertainty == "low_certainty"
)

AnswerPolarity_highCertain <-compare_groups(
  fit_to_use,
  lower  = AnswerPolarity == "positive" & AnswerCertainty == "high_certainty",
  higher = AnswerPolarity == "negative" & AnswerCertainty == "high_certainty"
)

AnswerPolarity_exhaustive <-compare_groups(
  fit_to_use,
  lower  = AnswerPolarity == "positive" & AnswerCertainty == "exhaustive",
  higher = AnswerPolarity == "negative" & AnswerCertainty == "exhaustive"
)

ContextType_neutral_negative <- 
  compare_groups(fit_to_use, higher = ContextType == "neutral", lower = ContextType == "negative")

ContextType_neutral_positive <- 
  compare_groups(fit_to_use, higher = ContextType == "neutral", lower = ContextType == "positive")


cellComparisons <- tribble(
  ~comparison, ~measure, ~posterior, ~"HDI (low)", ~"HDI (high)",
  "non-answer < low certainty" , "relevance", nonAns_VS_low$post_prob, nonAns_VS_low$l_ci, nonAns_VS_low$u_ci,  
  "low certain < high certain" , "relevance", low_VS_high$post_prob, low_VS_high$l_ci, low_VS_high$u_ci,  
  "high certain < exhaustive" , "relevance", high_VS_exh$post_prob, high_VS_exh$l_ci, high_VS_exh$u_ci,  
  "answer: pos < neg", "relevance", AnswerPolarity_main$post_prob, AnswerPolarity_main$l_ci, AnswerPolarity_main$u_ci,
  "context: neutral > pos", "relevance", ContextType_neutral_positive$post_prob, ContextType_neutral_positive$l_ci, ContextType_neutral_positive$u_ci,
  "context: neutral > neg", "relevance", ContextType_neutral_negative$post_prob, ContextType_neutral_negative$l_ci, ContextType_neutral_negative$u_ci
)

write_csv(cellComparisons,
          "R_data_4_TeX/results-relevance-ratings.csv", col_names = T)

knitr::kable(cellComparisons)
comparison measure posterior HDI (low) HDI (high)
non-answer < low certainty relevance 1.00000 1.8158322 2.4952422
low certain < high certain relevance 0.99850 0.2082523 0.9073744
high certain < exhaustive relevance 1.00000 0.9603053 1.8931582
answer: pos < neg relevance 0.42625 -0.3057378 0.2491692
context: neutral > pos relevance 0.97925 -0.0037564 0.2987856
context: neutral > neg relevance 0.87575 -0.0603201 0.2136360

The table shows results indicating that there are (non-surprising) effects of AnswerType with non-answers rated as least relevant, followed by low-certainty, then high-certainty answers, and final exhaustive answers. There is no (strong) indication for a main effect of AnswerPolarity or ContextType. The lack of an effect of ContextType might be interpreted as prima facie evidence in favor of quantitative notions or relevance that do not take the prior into account (at least not very prominently).

Here is a plot of the relevant posterior draws visually supporting why we compared the three factor levels of ContextType in the way we did (positive is the lowest, neutral the highest, but this difference is still not strongly indicative of a difference (0 included in HDI)):

Code
draws_ContextType <- 
  tibble(
    positive = filter_cell_draws(fit_to_use, ContextType == "positive", colname = "positive")$positive,
    negative = filter_cell_draws(fit_to_use, ContextType == "negative", colname = "negative")$negative,
    neutral  = filter_cell_draws(fit_to_use, ContextType == "neutral",  colname = "neutral" )$neutral
  ) |> pivot_longer(cols = everything())

draws_ContextType |> 
  ggplot(aes(x = value, color = name, fill = name)) +
  geom_density(alpha = 0.3)

Addressing the main research hypotheses

Research hypotheses 1 and 2 are basic predictions in terms of simple measures of first- and second-order belief change. Research hypothesis 3 is about different notions of quantifying informational relevance.

Hypothesis 1: first-order belief change explains relevance ratings

The hypothesis is that higher belief changes (induced by the answer) lead to higher relevance ratings. We test this hypothesis by a linear beta regression model (with maximal random effects) that regresses relevance ratings against the absolute difference between prior and posterior ratings (first_order_belief_change). We judge there to be evidence in favor of this hypothesis if the relevant slope coefficient is estimated to be credibly bigger than zero (posterior probability > 0.944; an arbitrary value to indicate that there is nothing special about 0.95) and a loo-based model comparison with an intercept only model substantially favors the model that includes the relevant slope.

Run the regression model:

Code
if (rerun_models) {
  
  fit_belief_diff <- ordbetareg(
    formula = relevance_sliderResponse ~ first_order_belief_change + 
      (1 + first_order_belief_change | submission_id) + (1 + first_order_belief_change | StimID),
    data = d
  )
  
  # TODO check how to run Intercept-only model in ordbetareg
  fit_belief_diff_interceptOnly <- ordbetareg(
    formula = relevance_sliderResponse ~  .,
    data = d |> mutate(Int = 1)
  )
  
  write_rds(fit_belief_diff, "cachedModels-round2/fit_belief_diff.rds")
  write_rds(fit_belief_diff_interceptOnly, "cachedModels-round2/fit_belief_diff_interceptOnly.rds")
  
} else {
  fit_belief_diff <- read_rds("cachedModels-round2/fit_belief_diff.rds")
  fit_belief_diff_interceptOnly <- read_rds("cachedModels-round2/fit_belief_diff_interceptOnly.rds")
}

Test hypothesis 1:

Code
hyp1_summary <- brms::hypothesis(fit_belief_diff, "first_order_belief_change > 0")
loo_comp_hyp1 <- loo_compare(
  list("w/__beliefDiff" = loo(fit_belief_diff),
       "w/o_beliefDiff" = loo(fit_belief_diff_interceptOnly)))
loo_comp_hyp1
               elpd_diff se_diff
w/__beliefDiff     0.0       0.0
w/o_beliefDiff -2235.8      57.4
Code
# Lambert's z-score test
Lambert_test(loo_comp_hyp1)
[1] 0
Code
post_samples <- hyp1_summary$samples |> pull(H1)
HDIs <- HDInterval::hdi(post_samples, credMass = 0.944)

myvars["hyp1PosteriorMean"] <- mean(post_samples) |> round(3)
myvars["hyp1PosteriorHypothesis"] <- hyp1_summary$hypothesis[1,"Post.Prob"]
myvars["hyp1HDILow"] <- HDIs['lower'] |> round(3)
myvars["hyp1HDIHigh"] <- HDIs['upper'] |> round(3)
myvars["hyp1LOODiff"] <- -1* (loo_comp_hyp1[2,1] |> round(3))
myvars["hyp1LOOSE"] <- loo_comp_hyp1[2,2] |> round(3)
myvars["hyp1LOOPValue"] <- Lambert_test(loo_comp_hyp1) |> round(3)

We find support for hypothesis one.

Hypothesis 2: commitment change additionally contributes to relevance rating

We also hypothesize that change in commitment (second_order_belief_change) ratings additionally contributes to predicting relevance ratings. Concretely, we address this hypothesis with a linear beta regression model like for hypothesis 1, but also including the absolute difference in commitment ratings for before and after the answer (and the interaction term). We use the maximal RE-structure. We speak of evidence in favor of this hypothesis if the relevant posterior slope parameter is credibly bigger than zero and a loo-based model comparison favors the more complex model. We speak of evidence against this hypothesis if the loo-based model comparison favors the simpler model.

Run the regression model:

Code
if (rerun_models) {
  
  fit_confidence_diff <- ordbetareg(
    formula = relevance_sliderResponse ~ first_order_belief_change * second_order_belief_change + 
      (1 + first_order_belief_change * second_order_belief_change | submission_id) + 
      (1 + first_order_belief_change * second_order_belief_change | StimID),
    data = d
  )
  
  write_rds(fit_confidence_diff, "cachedModels-round2/fit_confidence_diff.rds")
  
} else {
  fit_confidence_diff <- read_rds("cachedModels-round2/fit_confidence_diff.rds")
}

Test Hypothesis 2:

Code
hyp2_summary <- brms::hypothesis(fit_confidence_diff, "second_order_belief_change > 0")
loo_comp_hyp2 <- loo_compare(list("w/__confidence" = loo(fit_confidence_diff),
                             "w/o_confidence" = loo(fit_belief_diff)))
loo_comp_hyp2
               elpd_diff se_diff
w/__confidence   0.0       0.0  
w/o_confidence -53.5      11.1  
Code
# Lambert's z-score test
Lambert_test(loo_comp_hyp2)
[1] 0
Code
post_samples <- hyp2_summary$samples |> pull(H1)
HDIs <- HDInterval::hdi(post_samples, credMass = 0.944)

myvars["hyp2PosteriorMean"] <- mean(post_samples) |> round(3)
myvars["hyp2PosteriorHypothesis"] <- hyp2_summary$hypothesis[1,"Post.Prob"]
myvars["hyp2HDILow"] <- HDIs['lower'] |> round(3)
myvars["hyp2HDIHigh"] <- HDIs['upper'] |> round(3)
myvars["hyp2LOODiff"] <- -1* (loo_comp_hyp2[2,1] |> round(3))
myvars["hyp2LOOSE"] <- loo_comp_hyp2[2,2] |> round(3)
myvars["hyp2LOOPValue"] <- Lambert_test(loo_comp_hyp2) |> round(3)

We find support for hypothesis two.

Hypothesis 3: “Bayes Factor utility” is the best single-factor predictor of relevance ratings

The third hypothesis is that the bayes_factor_utility is a better (single-factor, linear) predictor of relevance_sliderResponse than kl_utility and entropy_change. We address this hypothesis with LOO cross-validation. We also directly include the exploratory hypothesis 1 here, thus comparing all single-factor models.

Code
get_single_factor_formula <- function(factor) {
  # get formula for single factor with full RE structure
  brms::brmsformula(
    str_c("relevance_sliderResponse ~ (1 + ", factor, " | submission_id)",
    "+ (1 + ", factor, " | StimID) + ", factor))
}

if (rerun_models) {
  
  # ER
  fit_loo_ER <- ordbetareg(
    get_single_factor_formula("entropy_change"),
    iter = 6000,
    save_pars = save_pars(all = T),
    data = d,
    control = list(adapt_delta = 0.9)
  ) |> add_criterion("loo", model_name = "ER", moment_match = T)
  
  # KL
  fit_loo_KL <- ordbetareg(
    get_single_factor_formula("kl_utility"),
    iter = 6000,
    save_pars = save_pars(all = T),
    data = d,
    control = list(adapt_delta = 0.9)
  ) |> add_criterion("loo", model_name = "KL")
  
  # BF
  fit_loo_BF <- ordbetareg(
    get_single_factor_formula("bayes_factor_utility"),
    iter = 6000,
    save_pars = save_pars(all = T),
    data = d,
    control = list(adapt_delta = 0.9)
  ) |> add_criterion("loo", model_name = "BF")
  
  # ER_beta
  fit_loo_ER_beta <- ordbetareg(
    get_single_factor_formula("beta_entropy_change"),
    iter = 6000,
    save_pars = save_pars(all = T),
    data = d,
    control = list(adapt_delta = 0.9)
  ) |> add_criterion("loo", model_name = "ER_beta")
  
  # KL_beta
  fit_loo_KL_beta <- ordbetareg(
    get_single_factor_formula("beta_kl_utility"),
    iter = 6000,
    save_pars = save_pars(all = T),
    data = d,
    control = list(adapt_delta = 0.9)
  ) |> add_criterion("loo", model_name = "KL_beta")
  
  # BF_beta
  fit_loo_BF_beta <- ordbetareg(
    get_single_factor_formula("beta_bayes_factor_utility"),
    iter = 6000,
    save_pars = save_pars(all = T),
    data = d,
    control = list(adapt_delta = 0.9)
  ) |> add_criterion("loo", model_name = "BF_beta")
  
  # belief_diff 
  fit_loo_belief_diff <- ordbetareg(
    get_single_factor_formula("first_order_belief_change"),
    iter = 6000,
    save_pars = save_pars(all = T),
    data = d,
    control = list(adapt_delta = 0.9)
  ) |> add_criterion("loo", model_name = "belief_diff")
  
  # commitment_diff 
  fit_loo_confidence_diff <- ordbetareg(
    get_single_factor_formula("second_order_belief_change"),
    iter = 6000,
    save_pars = save_pars(all = T),
    data = d,
    control = list(adapt_delta = 0.9)
  ) |> add_criterion("loo", model_name = "confidence_diff")
  
  write_rds(fit_loo_ER, "cachedModels-round2/fit_loo_ER.rds")
  write_rds(fit_loo_KL, "cachedModels-round2/fit_loo_KL.rds")
  write_rds(fit_loo_BF, "cachedModels-round2/fit_loo_BF.rds")
  write_rds(fit_loo_ER_beta, "cachedModels-round2/fit_loo_ER_beta.rds")
  write_rds(fit_loo_KL_beta, "cachedModels-round2/fit_loo_KL_beta.rds")
  write_rds(fit_loo_BF_beta, "cachedModels-round2/fit_loo_BF_beta.rds")
  write_rds(fit_loo_belief_diff, "cachedModels-round2/fit_loo_belief_diff.rds")
  write_rds(fit_loo_confidence_diff, "cachedModels-round2/fit_loo_confidence_diff.rds")
  
} else{
  fit_loo_ER <- read_rds("cachedModels-round2/fit_loo_ER.rds")
  fit_loo_KL <- read_rds("cachedModels-round2/fit_loo_KL.rds")
  fit_loo_BF <- read_rds("cachedModels-round2/fit_loo_BF.rds")
  fit_loo_ER_beta <- read_rds("cachedModels-round2/fit_loo_ER_beta.rds")
  fit_loo_KL_beta <- read_rds("cachedModels-round2/fit_loo_KL_beta.rds")
  fit_loo_BF_beta <- read_rds("cachedModels-round2/fit_loo_BF_beta.rds")
  fit_loo_belief_diff <- read_rds("cachedModels-round2/fit_loo_belief_diff.rds")
  fit_loo_confidence_diff <- read_rds("cachedModels-round2/fit_loo_confidence_diff.rds")
}

LOO-based model comparison:

Code
loo_compare(
 fit_loo_ER, 
 fit_loo_KL, 
 fit_loo_BF,
 fit_loo_ER_beta, 
 fit_loo_KL_beta, 
 fit_loo_BF_beta,
 fit_loo_belief_diff,
 fit_loo_confidence_diff
)
                        elpd_diff se_diff
fit_loo_BF                 0.0       0.0 
fit_loo_BF_beta         -168.2      37.0 
fit_loo_ER              -179.4      29.6 
fit_loo_KL_beta         -185.8      18.7 
fit_loo_KL              -325.8      26.2 
fit_loo_belief_diff     -357.9      29.6 
fit_loo_confidence_diff -559.3      37.0 
fit_loo_ER_beta         -695.1      37.8 
Code
model_names <- c("belief diff.", "entropy change", "KL utility", "BF utility", 
                 "confidence diff.", "entropy change (beta)", 
                 "KL utility (beta)", "BF utility (beta)")
tibble(
  model = factor(model_names, levels = rev(model_names)),
  level = rep(c("first-order", "second-order"), each = 4),
  ELPD  = c(
    loo(fit_loo_belief_diff)$estimates['elpd_loo','Estimate'],
    loo(fit_loo_ER)$estimates['elpd_loo','Estimate'], 
    loo(fit_loo_KL)$estimates['elpd_loo','Estimate'],
    loo(fit_loo_BF)$estimates['elpd_loo','Estimate'],
    loo(fit_loo_confidence_diff)$estimates['elpd_loo','Estimate'],
    loo(fit_loo_ER_beta)$estimates['elpd_loo','Estimate'],  
    loo(fit_loo_KL_beta)$estimates['elpd_loo','Estimate'], 
    loo(fit_loo_BF_beta)$estimates['elpd_loo','Estimate']
  ),
  SE = c(
    loo(fit_loo_belief_diff)$estimates['elpd_loo','SE'],
    loo(fit_loo_ER)$estimates['elpd_loo','SE'], 
    loo(fit_loo_KL)$estimates['elpd_loo','SE'],
    loo(fit_loo_BF)$estimates['elpd_loo','SE'],
    loo(fit_loo_confidence_diff)$estimates['elpd_loo','SE'],
    loo(fit_loo_ER_beta)$estimates['elpd_loo','SE'],  
    loo(fit_loo_KL_beta)$estimates['elpd_loo','SE'], 
    loo(fit_loo_BF_beta)$estimates['elpd_loo','SE']
  ),
  lower = ELPD - SE,
  upper = ELPD + SE
) |> 
  ggplot(aes( x = model , y = ELPD, color = level)) +
  geom_linerange(aes(min = lower, max = upper), size = 0.5) +
  geom_point(aes(shape = level), size = 2.5) +
  coord_flip() +
  xlab("") +
  ylab("expected log likelihood (LOO-CV) ")

Code
ggsave("plots/comparison-single-factors.pdf", scale = 0.8, width = 8, height = 4)

Testing whether there is a substantial difference between the two best models:

Code
loo_comp_hyp3 <- loo_compare(fit_loo_BF, fit_loo_BF_beta)
loo_comp_hyp3
                elpd_diff se_diff
fit_loo_BF         0.0       0.0 
fit_loo_BF_beta -168.2      37.0 
Code
Lambert_test(loo_comp_hyp3)
[1] 0
Code
# Comparing the best model to the second best model
myvars["hyp3BFvsBFBetaLOODiff"] <- -1* (loo_comp_hyp3[2,1] |> round(3))
myvars["hyp3BFvsBFBetaLOOSE"] <- loo_comp_hyp3[2,2] |> round(3)
myvars["hyp3BFvsBFBetaPValue"] <- Lambert_test(loo_comp_hyp3) |> round(3)

Yes, there is a noteworthy difference.

We can conclude that first-order BF is the single best predictor of the relevance ratings.

<<<<<<< HEAD

Now we want to visualize all results:

Code
library(viridis)
Code
d |> 
    ggplot(aes(x = prior_sliderResponse, y = posterior_sliderResponse, color = relevance_sliderResponse))  +
    geom_point(size = 3, alpha = 0.5) + 
    scale_color_viridis()

And let’s zoom in on some interesting spots:

Code
d |> 
    ggplot(aes(x = prior_sliderResponse, y = posterior_sliderResponse, color = relevance_sliderResponse))  +
    geom_jitter(width = 0.01, height = 0.01, size = 3, alpha = 0.7) + 
    scale_color_viridis() + 
    coord_cartesian(xlim = c(0, 0.25), ylim = c(0, 0.25))

Note: redo these plots but show delta of relevance_sliderResponse and BF etc.

And we do the same thing with the commitment ratings TODO: Sanity check whether condition with high bias prior and contradictory answer lead to decrease in commitment.

Code
d |> 
    ggplot(aes(x = prior_confidence, y = posterior_confidence, color = relevance_sliderResponse))  +
    geom_jitter(width = 0.4, height = 0.4, size = 3, alpha = 0.4) + 
    scale_color_viridis()

And now we can plot delta commitment and delta probability as x and y TODO: try removing non-answers from these (and other) plots

Code
d |> 
    ggplot(aes(x = first_order_belief_change, y = second_order_belief_change, color = relevance_sliderResponse))  +
    geom_jitter(width = 0.0, height = 0.3, size = 3, alpha = 0.4) + 
    scale_color_viridis()

Code
d |> 
    ggplot(aes(x = first_order_belief_change, y = second_order_belief_change, color = relevance_sliderResponse))  +
    geom_jitter(width = 0.01, height = 0.3, size = 3, alpha = 0.4) + 
    scale_color_viridis() + 
    coord_cartesian(xlim = c(-0.01, 0.15), ylim = c(-0.3, 4))

======= # Addressing the exploratory hypotheses

Exploratory Hypothesis 2: adding pure_second_order_belief_change to all first-order measures

To complement hypothesis 2, we also test whether adding another measure of higher-order uncertainty change adds predictive performance to each first-order measure of belief change. So here we compare, for each measure \(X\) (“entropy change”, “KL”, and “Bayes factor”) for first-order belief change, whether adding the factor pure_second_order_belief_change increases the predictive performance. Concretely, we compare a model with single factor \(X\) as a predictor to a model with predictors \(X\), pure_second_order_belief_change and their interaction. For ease of fitting, no random effects are included.

First, run the relevant models.

Code
get_formula_two_factors <- function(factor1, factor2) {
  # brms::brmsformula(
  #   str_c(
  #     "relevance_sliderResponse ~ (1 + ", factor1, " + " ,  factor2, " | submission_id)",
  #     "+ (1 + ", factor1, " + " , factor2, " | StimID) + ", 
  #     factor1, " * ", factor2))
  if (factor2 == "") {
    brms::brmsformula(
      str_c(
        "relevance_sliderResponse ~ ", factor1))  
  } else {
    brms::brmsformula(
    str_c(
      "relevance_sliderResponse ~ ", factor1, " * " ,  factor2))
  }
}

# get_formula_two_factors("X", "Y")
# get_formula_two_factors("X", "")

fit_model_two_factors <- function(name, factor1, factor2) {
  filename <- paste0(c("cachedModels-round2/", name, ".rds"), collapse = "")
  if (rerun_models) {
    x <- ordbetareg(
      get_formula_two_factors(factor1, factor2),
      iter = 2000,
      save_pars = save_pars(all = T),
      data = d,
      control = list(adapt_delta = 0.9)
    ) |> add_criterion("loo", model_name = name, moment_match = T)
    write_rds(x, filename) 
    return(x)
  } else{
    read_rds(filename) 
  }
  
}

# ER
fit_expHyp2_ER <- fit_model_two_factors(
  "fit_expHyp2_ER", 
  "entropy_change", 
  "")
fit_expHyp2_ER_pure <- fit_model_two_factors(
  "fit_expHyp2_ER_pure", 
  "entropy_change", 
  "pure_second_order_belief_change")


#KL
fit_expHyp2_KL <- fit_model_two_factors(
  "fit_expHyp2_KL", 
  "kl_utility", 
  "")
fit_expHyp2_KL_pure <- fit_model_two_factors(
  "fit_expHyp2_KL_pure", 
  "kl_utility", 
  "pure_second_order_belief_change")


#BF
fit_expHyp2_BF <- fit_model_two_factors(
  "fit_expHyp2_BF", 
  "bayes_factor_utility", 
  "")
fit_expHyp2_BF_pure <- fit_model_two_factors(
  "fit_expHyp2_BF_pure", 
  "bayes_factor_utility", 
  "pure_second_order_belief_change")

Then compare the relevant pairs with LOO:

Code
(loo_c_explHyp2_ER <- loo_compare(fit_expHyp2_ER, fit_expHyp2_ER_pure))
                    elpd_diff se_diff
fit_expHyp2_ER_pure   0.0       0.0  
fit_expHyp2_ER      -63.7      11.8  
Code
(loo_c_explHyp2_KL <- loo_compare(fit_expHyp2_KL, fit_expHyp2_KL_pure))
                    elpd_diff se_diff
fit_expHyp2_KL_pure    0.0       0.0 
fit_expHyp2_KL      -168.4      18.6 
Code
(loo_c_explHyp2_BF <- loo_compare(fit_expHyp2_BF, fit_expHyp2_BF_pure))
                    elpd_diff se_diff
fit_expHyp2_BF_pure   0.0       0.0  
fit_expHyp2_BF      -85.1      11.8  

It appears that for all three measures of first-order belief change, adding pure_second_order_belief_change yields a substantially better model.

Code
myvars["expHyp2ERLOODiff"] <- -1* (loo_c_explHyp2_ER[2,1] |> round(3))
myvars["expHyp2ERLOOSE"]   <- loo_c_explHyp2_ER[2,2] |> round(3)
  
myvars["expHyp2KLLOODiff"] <-  -1* (loo_c_explHyp2_KL[2,1] |> round(3))
myvars["expHyp2KLLOOSE"]   <-  loo_c_explHyp2_KL[2,2] |> round(3)

myvars["expHyp2BFLOODiff"] <-  -1* (loo_c_explHyp2_BF[2,1] |> round(3))
myvars["expHyp2BFLOOSE"]   <-    loo_c_explHyp2_BF[2,2] |> round(3)

Exploratory Hypothesis 3: compare all combinations of first- and second-order measures

Finally, we just compare models with all combinations of first- and second-order measures. Questions of interest are:

  1. Which arbitrary combination of first- and second-order measures is the best?
  2. Does it matter to be consistent in the choice of first- and second order measure, i.e., is the performance of “first-order X” always most boosted when we supply it with “second-order X” instead of some other “second-order Y”?

Let’s run the models first. For ease of fitting, no random effects are included.

Code
# base-level: pure

fit_loo_pure_ER <- fit_model_two_factors(
  "fit_loo_pure_ER",  "first_order_belief_change",  "beta_entropy_change")

fit_loo_pure_KL <- fit_model_two_factors(
  "fit_loo_pure_KL",  "first_order_belief_change",  "beta_kl_utility")

fit_loo_pure_BF <- fit_model_two_factors(
  "fit_loo_pure_BF",  "first_order_belief_change",  "beta_bayes_factor_utility")

fit_loo_pure_pure <- fit_model_two_factors(
  "fit_loo_pure_pure",  "first_order_belief_change",  "pure_second_order_belief_change")

# base-level: ER

fit_loo_ER_ER <- fit_model_two_factors(
  "fit_loo_ER_ER",  "entropy_change",  "beta_entropy_change")

fit_loo_ER_KL <- fit_model_two_factors(
  "fit_loo_ER_KL",  "entropy_change",  "beta_kl_utility")

fit_loo_ER_BF <- fit_model_two_factors(
  "fit_loo_ER_BF",  "entropy_change",  "beta_bayes_factor_utility")

fit_loo_ER_pure <- fit_model_two_factors(
  "fit_loo_ER_pure",  "entropy_change",  "pure_second_order_belief_change")

# base-level: KL

fit_loo_KL_ER <- fit_model_two_factors(
  "fit_loo_KL_ER",  "kl_utility",  "beta_entropy_change")

fit_loo_KL_KL <- fit_model_two_factors(
  "fit_loo_KL_KL",  "kl_utility",  "beta_kl_utility")

fit_loo_KL_BF <- fit_model_two_factors(
  "fit_loo_KL_BF",  "kl_utility",  "beta_bayes_factor_utility")

fit_loo_KL_pure <- fit_model_two_factors(
  "fit_loo_KL_pure",  "kl_utility",  "pure_second_order_belief_change")

# base-level: BF

fit_loo_BF_ER <- fit_model_two_factors(
  "fit_loo_BF_ER",  "bayes_factor_utility",  "beta_entropy_change")

fit_loo_BF_KL <- fit_model_two_factors(
  "fit_loo_BF_KL",  "bayes_factor_utility",  "beta_kl_utility")

fit_loo_BF_BF <- fit_model_two_factors(
  "fit_loo_BF_BF",  "bayes_factor_utility",  "beta_bayes_factor_utility")

fit_loo_BF_pure <- fit_model_two_factors(
  "fit_loo_BF_pure",  "bayes_factor_utility",  "pure_second_order_belief_change")

Compare all models.

Code
loo_compare(
 fit_loo_pure_ER, 
 fit_loo_pure_KL, 
 fit_loo_pure_BF, 
 fit_loo_pure_pure,
 fit_loo_ER_ER, 
 fit_loo_ER_KL, 
 fit_loo_ER_BF, 
 fit_loo_ER_pure,
 fit_loo_KL_ER, 
 fit_loo_KL_KL, 
 fit_loo_KL_BF,
 fit_loo_KL_pure,
 fit_loo_BF_ER, 
 fit_loo_BF_KL, 
 fit_loo_BF_BF,
 fit_loo_BF_pure
)
                  elpd_diff se_diff
fit_loo_BF_BF        0.0       0.0 
fit_loo_BF_pure    -37.7      11.9 
fit_loo_ER_BF     -103.2      21.6 
fit_loo_BF_ER     -122.9      15.3 
fit_loo_BF_KL     -124.3      15.6 
fit_loo_ER_KL     -136.3      19.3 
fit_loo_KL_BF     -171.6      19.3 
fit_loo_pure_BF   -182.3      21.5 
fit_loo_ER_pure   -238.6      28.9 
fit_loo_pure_KL   -266.7      19.9 
fit_loo_KL_KL     -267.1      18.6 
fit_loo_pure_pure -281.3      29.1 
fit_loo_KL_pure   -282.3      27.2 
fit_loo_ER_ER     -302.1      31.4 
fit_loo_KL_ER     -450.7      30.1 
fit_loo_pure_ER   -493.8      33.4 

Plot results.

Code
model_names <- c(
 "pure + ER   ", "pure + KL   ", "pure + BF   ", "pure + pure ",
 "  ER + ER   ", "  ER + KL   ", "  ER + BF   ", "  ER + pure ",
 "  KL + ER   ", "  KL + KL   ", "  KL + BF   ", "  KL + pure ",
 "  BF + ER   ", "  BF + KL   ", "  BF + BF   ", "  BF + pure "
  )



# model_names <- c(
#   "1st: ER  2nd: ER  ", 
#   "1st: ER  2nd: KL  ", 
#   "1st: ER  2nd: BF  ", 
#   "1st: ER  2nd: pure",
#   "1st: KL  2nd: ER  ", 
#   "1st: KL  2nd: KL  ", 
#   "1st: KL  2nd: BF  ", 
#   "1st: KL  2nd: pure",
#   "1st: BF  2nd: ER  ", 
#   "1st: BF  2nd: KL  ", 
#   "1st: BF  2nd: BF  ", 
#   "1st: BF  2nd: pure"
#   )

tibble(
  model = factor(model_names, levels = rev(model_names)),
  ELPD  = c(
    loo(fit_loo_pure_ER)$estimates['elpd_loo', 'Estimate'],
    loo(fit_loo_pure_KL)$estimates['elpd_loo', 'Estimate'],
    loo(fit_loo_pure_BF)$estimates['elpd_loo', 'Estimate'],
    loo(fit_loo_pure_pure)$estimates['elpd_loo', 'Estimate'],
    loo(fit_loo_ER_ER)$estimates['elpd_loo', 'Estimate'],
    loo(fit_loo_ER_KL)$estimates['elpd_loo', 'Estimate'],
    loo(fit_loo_ER_BF)$estimates['elpd_loo', 'Estimate'],
    loo(fit_loo_ER_pure)$estimates['elpd_loo', 'Estimate'],
    loo(fit_loo_KL_ER)$estimates['elpd_loo', 'Estimate'],
    loo(fit_loo_KL_KL)$estimates['elpd_loo', 'Estimate'],
    loo(fit_loo_KL_BF)$estimates['elpd_loo', 'Estimate'],
    loo(fit_loo_KL_pure)$estimates['elpd_loo', 'Estimate'],
    loo(fit_loo_BF_ER)$estimates['elpd_loo', 'Estimate'],
    loo(fit_loo_BF_KL)$estimates['elpd_loo', 'Estimate'],
    loo(fit_loo_BF_BF)$estimates['elpd_loo', 'Estimate'],
    loo(fit_loo_BF_pure)$estimates['elpd_loo', 'Estimate']
  ),
  SE = c(
    loo(fit_loo_pure_ER)$estimates['elpd_loo', 'SE'],
    loo(fit_loo_pure_KL)$estimates['elpd_loo', 'SE'],
    loo(fit_loo_pure_BF)$estimates['elpd_loo', 'SE'],
    loo(fit_loo_pure_pure)$estimates['elpd_loo', 'SE'],
    loo(fit_loo_ER_ER)$estimates['elpd_loo', 'SE'],
    loo(fit_loo_ER_KL)$estimates['elpd_loo', 'SE'],
    loo(fit_loo_ER_BF)$estimates['elpd_loo', 'SE'],
    loo(fit_loo_ER_pure)$estimates['elpd_loo', 'SE'],
    loo(fit_loo_KL_ER)$estimates['elpd_loo', 'SE'],
    loo(fit_loo_KL_KL)$estimates['elpd_loo', 'SE'],
    loo(fit_loo_KL_BF)$estimates['elpd_loo', 'SE'],
    loo(fit_loo_KL_pure)$estimates['elpd_loo', 'SE'],
    loo(fit_loo_BF_ER)$estimates['elpd_loo', 'SE'],
    loo(fit_loo_BF_KL)$estimates['elpd_loo', 'SE'],
    loo(fit_loo_BF_BF)$estimates['elpd_loo', 'SE'],
    loo(fit_loo_BF_pure)$estimates['elpd_loo', 'SE']
  ),
  lower = ELPD - SE,
  upper = ELPD + SE
) |> 
  ggplot(aes( x = model , y = ELPD)) +
  geom_pointrange(aes(min = lower, max = upper)) +
  coord_flip() +
  xlab("") +
  ylab("expected log likelihood (LOO-CV)") +
  theme(axis.text.y = element_text(family = "mono"))

Code
ggsave("plots/comparison-exploratory-hyp3.pdf", scale = 1, width = 8, height = 4)
Code
(loo_comp_explHyp3 <- loo_compare(loo(fit_loo_BF_BF), loo(fit_loo_BF_pure)))
                elpd_diff se_diff
fit_loo_BF_BF     0.0       0.0  
fit_loo_BF_pure -37.7      11.9  
Code
myvars["explHyp3LOODiff"] <- -1* (loo_comp_explHyp3[2,1] |> round(3))
myvars["explHyp3LOOSE"] <- loo_comp_explHyp3[2,2] |> round(3)
myvars["explHyp3PValue"] <- Lambert_test(loo_comp_explHyp3) |> round(3)

It seems that the overall best model in this comparison is the one that uses Bayes-factor based measures consistently. The second-order Bayes-factor based measure also seems to be the best to add to the other first-order measures. This also means that it is not the case the “being consisten” in choic of first- and second-order measure is always best.

Saving variables for LaTeX

8244c7c2d29bb608f7da8df203c06dc3655ae1a0